feat: add skip.when conditional column generation#502
Conversation
Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor
- Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor
- Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor
…ng, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 7046378 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor
…479-skip-conditional-gen-implementation
Introduce SkipConfig on SingleColumnConfig to gate column generation with a Jinja2 expression. Columns can be skipped by expression or by upstream propagation (propagate_skip flag). - SkipConfig: Pydantic model with config-time syntax/delimiter/variable validation and cached column extraction from the Jinja2 AST - skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment with fail-safe error handling (skip on expected failures) - skip_provenance: centralized __skipped__ record tracking shared by sync builder, async scheduler, and buffer managers - DAG/ExecutionGraph: skip.columns wired as dependency edges in both topological sort and static execution graph - Validation: validate_skip_references checks reference existence, sampler/seed scope, and allow_resize conflicts - Sync builder: cell-by-cell and full-column skip with merge-back - Async scheduler: cell and batch skip with live-buffer provenance Made-with: Cursor
- Add skip evaluation to _fan_out_with_async (was missing, causing skipped rows to still be sent to the LLM) - Preserve __skipped__ provenance on non-skipped records after full-column generation so multi-hop propagation works - Use single live-buffer reference in _run_batch skip loop for consistency with _run_cell - Move Template import to TYPE_CHECKING and reorder import blocks - Replace O(n²) sum() with itertools.chain in dag.py - Add set_required_columns/set_propagate_skip/set_skip_config setters to ExecutionGraph for symmetry with existing API Made-with: Cursor
Add a new recipe demonstrating skip.when patterns (expression gate, propagation, opt-out) with a customer support ticket pipeline. Also extract _should_skip_record in async_scheduler, remove the redundant propagate_skip param from should_skip_by_propagation, and pass a precomputed all_side_effects set through the DAG sort. Made-with: Cursor
Return False when the graph has not been initialized instead of raising, since skip logic cannot apply before generators are set up. Made-with: Cursor
Greptile SummaryThis PR adds per-row conditional column generation via
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/base.py | Adds SkipConfig model with Jinja2 syntax validation, columns cached_property for AST-based variable extraction, and propagate_skip field on SingleColumnConfig; model validator correctly blocks self-references, sampler/seed columns, and allow_resize combinations. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_evaluator.py | New file implementing NativeSandboxedEnvironment (correct MRO combining sandbox + native-type output), evaluate_skip_when with fail-safe semantics, and should_skip_column_for_record unified decision function; LRU-cached template compilation is a sound optimization. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_tracker.py | New file implementing the __internal_skipped_columns metadata lifecycle (apply, strip for DataFrame, restore via hidden ID columns); restore-ID mechanism correctly handles row reordering and allow_resize cases with proper bijection validation. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Splits _run_full_column_generator into skip-aware and non-skip-aware paths; fan-out methods gain per-row skip evaluation; key issue: _merge_skipped_and_generated uses positional merge (no restore IDs) while the sibling without-skip path is reorder-safe, and update_record in fan-out skip paths is a no-op. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | _run_cell and _run_batch gain skip-aware paths; cell skip writes directly to live buffer references (correct), batch skip pre-evaluates before building the DataFrame; result_idx alignment with pre_skipped rows is handled correctly. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py | Stores SkipConfig, propagate_skip, required_columns, and side_effects_by_producer per column with clean accessor methods; set_required_columns correctly stores producer-resolved dependencies only. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py | Refactors edge-building into _add_dependency_edges helper and correctly adds skip.when column references as DAG edges; all_side_effects pre-computation avoids repeated flattening. |
| packages/data-designer-engine/src/data_designer/engine/validation.py | Adds validate_skip_references covering reference existence, sampler/seed scope, and allow_resize conflicts; the sampler/seed and allow_resize checks are defense-in-depth (model validators fire first) which is acceptable. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py | Correctly strips skip metadata from get_current_batch(as_dataframe=True) and the parquet write path; iter_current_batch intentionally yields live buffer references (enabling in-place mutation by skip writers). |
| packages/data-designer-config/src/data_designer/config/column_configs.py | Adds _coerce_skip_value_to_dtype validator on ExpressionColumnConfig that mutates skip.value via Pydantic's non-frozen attribute assignment; bool/int subclass edge case (isinstance(True, int) is True) is minor and unlikely to matter in practice. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["_run_full_column_generator(generator)"] --> B{column_name != None\nAND _column_can_skip?}
B -- "No (MultiColumnConfig\nor allow_resize=True)" --> C["_run_full_column_generator_without_skip"]
B -- Yes --> D["_run_full_column_generator_with_skip"]
C --> C1["prepare_records_for_skip_metadata_round_trip\n(injects restore IDs if skip metadata present)"]
C1 --> C2["generator.generate(full_df)"]
C2 --> C3["restore_skip_metadata via restore IDs\n(reorder-safe)"]
C3 --> C4["batch_manager.replace_buffer"]
D --> D1["Iterate all records\nevaluate _should_skip_cell per row"]
D1 --> D2{any row skipped?}
D2 -- No --> C
D2 -- Yes --> D3["_merge_skipped_and_generated\n(positional merge — assumes generator\npreserves row order)"]
D3 --> D4["generator.generate(active_rows_only)"]
D4 --> D5["Merge: skipped rows keep record ref\nactive rows use gen_result + copy prior_skipped"]
D5 --> D6["batch_manager.replace_buffer"]
style D3 fill:#fff3cd,stroke:#856404
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 615-641
Comment:
**Positional merge in skip path diverges from restore-ID safety in no-skip path**
`_merge_skipped_and_generated` aligns generator output with `records_with_skip_status` by position (`next(result_iter)`). `_run_full_column_generator_without_skip` — the sibling path for the same generator — now uses restore IDs to handle row reordering across the DataFrame round-trip. The skip path gets no such safety net: if a FULL_COLUMN generator returns rows in a different order than it received them (e.g., a sorting step inside the generator) without `allow_resize=True`, the active-row results are silently assigned to the wrong rows.
The `_run_full_column_generator_without_skip` path already demonstrates how to handle this correctly with `prepare_records_for_skip_metadata_round_trip` / `restore_skip_metadata`. The same restore-ID mechanism could be applied to `active_records` in `_merge_skipped_and_generated` to keep both paths equally robust.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 775-789
Comment:
**`update_record` call is a no-op after in-place mutation**
`iter_current_batch()` yields direct references to `self._buffer[i]` (via `enumerate(self._buffer)`). `_write_skip_to_record(column_name, record)` mutates that dict in-place, so `self._buffer[i]` already reflects the skip metadata before `update_record` is called. The subsequent `self.batch_manager.update_record(i, record)` executes `self._buffer[i] = record`, which just re-assigns the same reference — a no-op. Same pattern appears in `_fan_out_with_async` at the corresponding location.
How can I resolve this? If you propose a fix, please make it concise.Reviews (11): Last reviewed commit: "fix: make expression jinja validator pri..." | Re-trigger Greptile
|
Both P1s from Greptile's review are addressed in 1affce9: P1: Skip metadata erased by non-skip-aware Added P1:
Tests added:
|
|
Docs preview: https://c435f332.dd-docs-preview.pages.dev
|
|
the skip integration tests in test_dataset_builder.py all run through the sync engine. might be good to add at least one that flips |
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record) into should_skip_column_for_record() in skip_evaluator.py so both sync and async engines call the same function (andreatgretel review comment) - Extend SkipConfig self-reference validation to cover side-effect columns (e.g. review__trace on the review column) — previously only checked self.name, now checks self.name | self.side_effect_columns - Add async engine integration tests for skip paths: cell-by-cell with propagation and full-column batch skip (exercises _run_cell / _run_batch) - Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default propagate_skip=True so it actually exercises the allow_resize guard - Move get_skipped_column_names from skip_tracker to skip_evaluator (sole production consumer) Made-with: Cursor
|
Done in bf6332e — added two async engine skip tests in |
| # |---|---|---| | ||
| # | **Expression gate** | `skip=dd.SkipConfig(when="...")` | Skip this column when the Jinja2 expression is truthy | | ||
| # | **Skip propagation** (default) | Downstream column depends on a skipped column | Automatically skipped too (`propagate_skip=True` by default) | | ||
| # | **Propagation opt-out** | `propagate_skip=False` on the downstream column | Always generates, even if an upstream was skipped | |
There was a problem hiding this comment.
nice examples. small nit: "Always generates" might be a touch too strong - if the column's own skip.when references an upstream value that got skipped (now None), the evaluator's fail-safe kicks in and skips the row anyway. maybe "generates even when upstream columns were skipped" would be more accurate?
andreatgretel
left a comment
There was a problem hiding this comment.
looking good, approving with nits. I think greptile's comment about allow_resize columns entering the skip branch via propagation is still open and worth addressing before merge.
- test_skip_evaluator: parametrized should_skip_column_for_record covering propagation, expression gates, short-circuiting, and disabled propagation - test_execution_graph: skip metadata accessors (get_skip_config, should_propagate_skip, get_required_columns, get_side_effect_columns, resolve_side_effect, skip.when DAG edges) - test_dataset_builder: chained transitive propagation (4 levels), two independent skip gates, custom skip.value, row count preservation Made-with: Cursor
…ip-conditional-gen-implementation Made-with: Cursor # Conflicts: # packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py # packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja to match the private naming convention used by other model validators. Made-with: Cursor
efbd5fe
Summary
Adds per-row conditional column generation via
SkipConfig(when="..."). Columns can now be skipped based on a Jinja2 expression evaluated against the current row, with automatic downstream propagation and an opt-out mechanism. This avoids expensive LLM calls for rows that don't need a particular column (e.g., complaint analysis only for low-rated reviews).Changes
Added
SkipConfigmodel onSingleColumnConfig—skip=SkipConfig(when="{{ expr }}")with configurablevalueandpropagate_skipflagskip_evaluator.py—NativeSandboxedEnvironmentfor safe expression evaluation withStrictUndefined, plusshould_skip_by_propagationfor upstream skip detectionskip_tracker.py— centralized__internal_skipped_columnsmetadata management on record dicts (apply, read, strip before DataFrame/parquet, round-trip preservation via restore IDs)test_skip_config.py,test_skip_evaluator.py,test_skip_tracker.py— unit tests for config validation, expression evaluation, and tracker logicSkipConfigrejected on sampler/seed-dataset columns, rejected withallow_resize, and self-referencingwhenexpressions are caughtvalidate_skip_referencesinvalidation.py— checks skip.when column references exist, rejects skip on sampler/seed and allow_resize columnsChanged
dataset_builder.py):_run_full_column_generatornow splits into skip-aware and non-skip-aware paths; fan-out methods (_fan_out_with_threads,_fan_out_with_async) evaluate skip per-record, exclude skipped rows from generator input, and merge results back with skip metadata preserved._column_can_skipfast-path short-circuits when no skip config or propagation applies. Skip metadata is preserved acrossreplace_bufferviaprepare_records_for_skip_metadata_round_trip/restore_skip_metadatarestore-ID mechanism.async_scheduler.py):_run_cellreturns(result, skipped)tuple;_run_batchpre-evaluates skip before building the batch DataFrame; skipped cells report as skipped in progress trackingdag.py):skip.whencolumn references added as dependency edges; refactored edge-building into shared_add_dependency_edgeshelperexecution_graph.py): storesSkipConfig,propagate_skip,required_columns, andside_effects_by_producerper column with accessor methods (get_skip_config,should_propagate_skip,get_required_columns,get_side_effect_columns)__internal_skipped_columnsmetadata before DataFrame construction and parquet writes2-structured-outputs-and-jinja-expressions): added conditional generation section demonstrating all three skip patterns (expression gate, propagation, opt-out)dataset-builders.md): documented conditional generation subsystemFixed
_run_full_column_generatornow correctly preserves row ordering when merging skipped and generated records back together (8931b45)_column_can_skipreturnsFalseforallow_resize=Truecolumns to prevent row-count mismatches in the skip-aware merge pathAttention Areas
dataset_builder.py— The skip-aware split/merge in_run_full_column_generator_with_skipand the restore-ID mechanism in_run_full_column_generator_without_skipthat preserves skip metadata across DataFrame round-tripsasync_scheduler.py—_run_batchnow pre-evaluates skip before building the batch DataFrame;_run_cellreturns a(result, skipped)tupleskip_evaluator.py— Fail-safe behavior (skip on evaluation error) is intentional to avoid wasting LLM calls on ambiguous rowsskip_tracker.py— Theprepare_records_for_skip_metadata_round_trip/restore_skip_metadatapair uses hidden restore-ID columns to survive DataFrame round-trips where row order or count may changeDescription updated with AI